Skip to content

Squeeze performance out of the (post)processing stage of PDF build.#154

Merged
KubaO merged 44 commits into
twinbasic:mainfrom
KubaO:staging
May 24, 2026
Merged

Squeeze performance out of the (post)processing stage of PDF build.#154
KubaO merged 44 commits into
twinbasic:mainfrom
KubaO:staging

Conversation

@KubaO
Copy link
Copy Markdown
Collaborator

@KubaO KubaO commented May 24, 2026

(Post)Processing stage reads the PDF that Chrome wrote out, modifies it (mostly a fresh outline i.e. bookmarks), and writes it out. Our own outline processing takes about 10ms, most of the time (~1s) is pdf-lib's loading and then saving of the PDF.

The heap use and CPU time have been squeezed down as much as possible without changing pdf-lib's API.

KubaO added 30 commits May 23, 2026 23:48
PDFObjectParser.parseDict ends every dict it parses with four
PDFName.of calls for Type/Catalog/Pages/Page, even on the dicts
that have no /Type entry at all. With fast-decode-name in effect
each call collapses to a fastCache.get, but fastOf was still the
twinbasic#4 row in process.cpuprofile at ~5%.

Pool-dedup makes the canonical PDFNames reference-stable for the
whole load, so capture them once at shim-load and substitute
module-level constants for the four calls. Drops ~17 ms (~22%)
of fastOf self-time. Output byte-equivalent.
The sampling heap profile of the process phase showed `new Map()` +
Map.prototype.set at ~80 MB combined (50 % of total allocations on
the book), 80 % of that traffic from the parser's per-dict
accumulator. PDF dicts are tiny (typically <= 10 entries), so the
hash-table arena per dict was pure overhead.

The new fast-dict-array shim patches PDFDict's storage to a flat
alternating array, plus all prototype methods (set/get/has/delete/
keys/values/entries/asMap/clone/toString/sizeInBytes/copyBytesInto)
and the parser's parseDict hot loop. Subsumes fast-dict-iter +
fast-parse-dict, both of which stay in the tree as A/B baselines.

Wall-clock: process phase 1.18s -> 1.13s (4 runs paired no-profile).
Heap: Map+set builtin traffic 79 MB -> 15 MB.
After fast-dict-array shipped, PDFContext.assign's
`indirectObjects.set(ref, object)` was the only hot Map.set left in
the heap profile -- one set per parsed indirect object, ~14 MB of
hash-table growth on the book.

fast-indirect-objects patches assign / lookup / lookupMaybe / delete
/ getObjectRef / enumerateIndirectObjects to consult a dense array
keyed by objectNumber for gen=0 PDFRefs (the overwhelming common
case), Map fallback for gen!=0. Lazy-init on first assign; no
constructor patch needed. Mirror of the fast-refs trick on the value
side.

CPU: PDFContext.assign drops out of the process top-15.
Heap: set traffic 14.8 MB -> 7.7 MB (-48 %). Remaining 7 MB is the
upstream PDFRef pool.set on cache miss -- next target.
fast-refs' dense-array cache already short-circuited the LOOKUP
side, but on a gen=0 miss it still called through to upstream
PDFRef.of, which redundantly populated the upstream Map<string,
PDFRef> pool. After fast-indirect-objects shipped, that pool was
the last hot Map.set in the heap profile -- ~7 MB of growth-arena
churn from ~9 k unique-objectNumber misses on the book.

Replace the gen=0 miss path with Object.create(PDFRef.prototype)
+ manual field init. PDFObject (super) has a no-op constructor and
the only fields prototype methods read are objectNumber /
generationNumber / tag, so direct construction is safe.

CPU: PDFRef.of drops out of process top-15 (~93 ms saved).
Heap: set traffic 7.7 MB -> 0.5 MB. The residual 504 KB is PDFName
interning's fastCache.set, static-size and harmless. No more
materially-hot Map.set in the process-phase heap profile.
After the previous round of allocator-shape shims, parseNumberOrRef
was the next-largest row in the process-phase heap profile at
~15 MB -- mostly inlined `new PDFNumber(value)` from the parser's
number branch. PDFs reuse a handful of integer values constantly
(page indices, /Count, /N, /MediaBox dimensions), and PDFNumber
is conceptually immutable, so pooling by value is safe.

fast-pdfnumber-pool installs a dense-array cache for non-negative
integers in [0, 16384), Map fallback for floats / negatives. Same
shape as fast-refs. parseNumberOrRef's row collapses off the top
10; total process-phase heap traffic drops 123 MB -> 107 MB (-13 %).
PDFNumber row settles at 0.8 MB (the floor: one instance per unique
value).

Also lands find-heap-callees.mjs -- the children-of analyzer used
to investigate fastParseDictArray's mystery 58 MB self-row (turned
out to be recursive parseDict invocations across nesting levels,
intrinsic).
Instrumented parseDict shows 261k invocations on the book, 52% of
dicts at exactly 5 entries (10 push slots), 80% at 4-5 entries,
96% <= 7 entries, max recursion only 3. The original `const arr = []`
+ push-grow path was wasting ~85% of fastParseDictArray's 58 MB
self-row on FixedArray growth garbage (cap 4 -> 8 -> 16 doubling
discards on every 5-entry dict).

Allocate the accumulator at `new Array(10)` -- exact fit for the
median, 0 growth for 80% of dicts, one growth for the 5% at 7
entries. Direct indexing with len counter; push only on overflow.

Heap: total sampled 107 MB -> 92 MB (-14%); fastParseDictArray row
58 MB -> 44 MB (-25%). SCRATCH=10 beat SCRATCH=16 (saved 9.5 MB
more) because the cap-16 baseline was itself ~46 MB across 261k
calls.

Also lands instrument-parsedict.mjs and the --instrument-parsedict
flag on measure.mjs for future dict-workload investigations.
SCRATCH was a leftover name from when I was sketching the
parser-wide long-lived buffer approach (which we rejected). The
shipped const is just the initial capacity of the per-dict
backing array -- not a scratch buffer at all. Rename for clarity.

Also softens the notes' "What about a scratch buffer?" follow-up
section title to "What about a true scratch buffer?", and notes in
the code comment that this is a pre-sized permanent backing array,
not scratch.
Each PDFDict would carry a (buf, start, end) view into a parser-wide
per-depth shared array that is append-only across all dicts at that
depth. Eliminates per-dict array allocation: 261k PDFDict-backing-
arrays collapse to 3 shared buffers per parser (max parseDict
recursion depth = 3 on the book). Per-depth caps pre-sized to the
book's measured workload + slack, so V8 doesn't grow them.
Mutations (catalog.set during setOutline) copy-on-write to a
private array.

Recursion handling: a single global shared buffer would interleave
outer and inner entries when parseObject recurses; per-depth
buffers keep outer's range contiguous.

Bug found and fixed during prototyping: `if (!this._dictDepth)`
re-init guard fires every time _dictDepth returns to 0 (between
top-level dicts), defeating buffer sharing. Use explicit
`_dictBufs === undefined` check.

Net win is modest: ~2.5 MB heap reduction vs fast-dict-array
(92.13 MB -> 89.68 MB). The buffer-sharing savings (~88 B/dict)
are largely offset by the larger PDFDict instance (Object.create
+ 5 fields vs constructor-inlined + 2). Superseded by the
one-buffer + packed-pointer approach that follows, which shrinks
the PDFDict instance back down.

Code dropped; narrative kept in perf/notes/08-pdf-lib.md as the
thinking that led to one-buffer.
The fast-dict-view PDFDict instance was 5 fields (~96 B); packing
the lot into a single 53-bit Number `d` would shrink the instance
significantly. Reads via bitwise (fields below bit 32) or
arithmetic (Math.floor(d / 2**n) & mask) for higher fields. The
PDFContext is a singleton in our pipeline (one PDFDocument.load
per process), so the shim keeps it at module level; a second
distinct context throws.

PDFDict instance shrinks ~96 B -> ~24-32 B. PDFPageLeaf still
needs `normalized` + `autoNormalizeCTM` slots (~1.6 k page leaves
of 261 k total dicts, small fraction).

Heap: 89.7 MB -> 83.7 MB (-6 MB / -7%). GC self-time: 167 ms ->
129 ms (-23%). Cumulative arc from the original Map-backed PDFDict:
152 MB -> 84 MB (-45%).

Superseded by the one-buffer PDFDict approach that follows: keeps
the "packed into a Number" idea but moves entries into a single
per-parser mainBuf, folding bufIdx away and giving a tighter
bit layout.

Code dropped; narrative kept in perf/notes/08-pdf-lib.md.
Collapse PDFDict storage into one long-lived mainBuf. Recursion
gotcha solved with a two-area split: a small per-parser temp
array acts as a stack of parseDict recursion frames; each frame
appends to temp, then commits its frame to main in one
contiguous append, then pops temp back. Outer's frame stays
parked in temp while inner recurses, then resumes intact when
inner pops.

Owned dicts (factory-created post-parse, COW results) append to
mainBuf too. Mutations: in-place replace for existing keys, COW
(copy range to tail + push new pair) for new keys or delete.
setOutline's pattern (create-then-recurse-then-set) hits one COW
per dict; subsequent sets extend in place at the high-water mark.
~9k entry copies total for the book, negligible.

PDFContext is a singleton in our pipeline; a second distinct
context throws. Instance state packs into one 53-bit Number:
24 bit start + 14 bit length + 1 bit owned + 14 spare = 37 bits
used.

Heap: total process heap 92 MB -> 66 MB (-28% vs fast-dict-array).
Cumulative arc since Map-backed PDFDict: 152 MB -> 66 MB (-57%).
GC self-time bumps slightly (one big live array to scan);
wall-clock within noise.

Mutually exclusive with --fast-dict-array / --fast-dict-iter /
--fast-parse-dict. Production swaps render-book.mjs's
fast-dict-array import for fast-dict-onebuf; the legacy shims
stay in the tree as A/B baselines. The intermediate fast-dict-
view and fast-dict-double prototypes explored on the way to this
shape are documented as "explored, didn't ship" sections in
perf/notes/08-pdf-lib.md.
Walks the PDF grammar (indirect objects, dicts, arrays, names,
numbers, refs, strings, streams, ObjStms-with-inflate) without
instantiating any PDFObject. Counts only: 261k dicts, 2.34M dict
slots, 81k arrays, 750k ref appearances, max recursion depth 4.
The measure pass is preparation for a two-pass measure-allocate-
work architecture where mainBuf becomes a Float64Array sized
exactly to measured demand, eliminating V8's mark traversal of
its 2.4M Object-ref slots.

On perf/raw.pdf (39.3 MB Chrome output), measure pass runs in
135 ms (min of 5) vs PDFDocument.load at 1238 ms -- ratio 0.109,
~9x cheaper. Architecture cleared: even at 80% work-pass cost,
135 + 990 = 1125 ms vs current 1238 ms is net win on CPU before
any GC reduction.

Wired:
- perf/measure.mjs --dump-raw-pdf <path>: one-time flag that saves
  Chrome's raw page.pdf() output before pdf-lib processing.
- perf/raw.pdf (gitignored): canonical 39.3 MB input for measure /
  heap-profile investigations going forward.
- perf/phase0-measure.mjs: the prototype walker. Measurement-only;
  doesn't ship in any production path.
docs/lib/measure-pass.mjs productionises the Phase 0 walker as a
stand-alone library exporting measure(bytes) -> counts.
fast-dict-onebuf gains setExpectedDictSlots(slots) that resizes
the module-level main backing Array to exact measured demand.
perf/measure.mjs gains --measure-pass that wires the two together
before PDFDocument.load, with mutex checks against --incremental,
--render-only, and the (required) --fast-dict-onebuf.

Structural validation: byte-identical output (1651 pages, 1773
outline nodes, matching titles; 31-byte rawPdf-timestamp jitter
on the saved bytes).

A V8 inline-cache gotcha worth capturing: the first cut reassigned
the module binding (`let main; main = new Array(N)`) which broke
IC slots in every hot closure that read main. Heap profile showed
_appendEntries leaking 27 MB and total sampled jumping 65 -> 92 MB,
despite the resized array being identical in shape. Pre-filling
with arr.fill(null) didn't help (wasn't an element-kind issue).
Fix: keep the same Array identity, resize in place via
`main.length = N`. Heap regression collapses to +0.14 MB noise.
Lesson recorded in notes/08-pdf-lib.md: never rebind a module-level
value that hot closures specialise against, even if language
semantics allow it -- mutate in place.

Measured cost (paired, production shim set):
  measure-pass:  +60 ms (inline; 135 ms standalone Phase 0)
  load:          unchanged (within noise)
  net process:   +40 ms

Heap: flat. Phase 1 doesn't change what gets allocated, only the
initial capacity of the backing Array (which is module-load-time
cost, invisible in process-phase profiles).

Behind --measure-pass flag in the harness. NOT yet in
docs/render-book.mjs's production import chain -- no current
consumer wins anything back from the measured counts. The flag
exists so a later commit can flip it into production once the
architecture has another consumer.
The owned/shared flag at bit 38 only gated set's append path:
"extend in place at HWM only if owned." Re-reading the safety
argument shows this is over-cautious. Each parseDict commits a
contiguous frame to main and mainLen advances past it; no two
PDFDict instances share slots. So if a dict's range satisfies
start + length === mainLen, the slots past mainLen are free
regardless of how the range was created. The owned/shared
distinction the bit encoded doesn't correspond to anything the
safety check needs.

Changes:
- pack(start, length): third arg gone; no POW_38 OR-in.
- _owned and POW_38: deleted.
- _cow: collapses to one branch (was two paths that differed
  only in the owned-at-HWM early return).
- set: gate becomes just `start + length !== mainLen`.
- _makeFromRange: owned param gone.
- _ownedFromArray renamed _makeFromAppend for accuracy.
- Bit 38 is now spare; spare grows from 14 to 15 bits.

Net behavioural change: shared dicts that still abut the HWM
at first .set now extend in place instead of COWing -- ~5-10
slot copies avoided per such mutation. Tiny win in the right
direction.

Byte-identical output (1651 pages, 1773 outline nodes,
matching titles). Heap flat: 65.34 MB vs 65.27 MB baseline,
within noise. Top frames in the heap profile are structurally
the same.
Measurement-only tooling for Phase 2 design. fast-dict-onebuf
exports `main` and a `getMainLen()` getter so external consumers
can inspect the buffer. perf/instrument-slot-types.mjs walks
main[0..mainLen) after setOutline and classifies each slot by
PDFObject subtype, printing key/value counts and percentages.
perf/measure.mjs gains --instrument-slot-types that loads the
module and invokes the classifier (requires --fast-dict-onebuf;
not compatible with --incremental / --render-only).

Distribution on the book (production shim set + --measure-pass):
total slots = 2 358 630, keys = 1 179 315, values = 1 179 315.

  type           keys      key%       values    value%   total%
  -----------------------------------------------------------------
  PDFName        1179315   100.00%    493256    41.83%   70.91%
  PDFRef               0     0.00%    435217    36.90%   18.45%
  PDFNumber            0     0.00%    162325    13.76%    6.88%
  PDFArray             0     0.00%     79468     6.74%    3.37%
  PDFDict              0     0.00%      5660     0.48%    0.24%
  PDFHexString         0     0.00%      1776     0.15%    0.08%
  PDFString            0     0.00%      1601     0.14%    0.07%
  PDFBool.True         0     0.00%        12     0.00%   0.0005%
  PDFBool.False        0     0.00%         0     0.00%        0
  PDFNull              0     0.00%         0     0.00%        0

Key findings: (1) keys are 100% PDFName -- the even/odd invariant
holds. (2) Four big pools (Name, Ref, Number, Dict) cover 96.4%
of all slots; encoding them directly into the Float64 mainBuf
collapses ~96% of slot-mark traversals. (3) Side-pool fallback
for unpooled types (Array, String, HexString) is ~3.5% --
~82 800 slots that V8 would still mark, vs ~2.34M today. (4)
Nested PDFDicts as slot values are only 5 660 -- most dicts are
referenced via Ref rather than embedded. (5) Bool/Null/RawStream
in dict slots are essentially zero; tag-only encoding works.

Classification cost: 39ms (single pass over 2.36M slots).
The next architectural step from Phase 1 -- replace the Object[]
mainBuf with a Float64Array, encode every entry (key and value
alike) as a 4-bit type tag + 49-bit pool id / payload. Subsumes
fast-refs and fast-pdfnumber-pool by owning PDFRef.of and
PDFNumber.of with built-in pool-id assignment; adds new pools for
PDFArray (sequential id), PDFString and PDFHexString (value-dedup
since they're immutable). PDFDict slots encode the existing
38-bit (start, length) payload directly.

A trap worth recording: the first cut eagerly cached every
parse-created PDFDict in dictByPayload so decodeValue(TAG_DICT)
would return the same instance. That writes 261k Map entries
during parse; total heap went 65 -> 92 MB. Fix: lazy
materialization. Top-level dicts (226k) live in indirectObjects
and are never decoded via TAG_DICT; only nested dicts (~5660)
are. Caching on first access caps the Map at ~5660 entries.

Measured result on faraday: wash.
  process wall      1.16 s -> 1.18 s  (~+20 ms, noise)
  GC self-time      151 ms -> 149 ms  (~0 ms)
  heap allocation   65 MB  -> 68 MB   (+3 MB from new pool Maps)
  marked main slots 2.34 M -> 0       (architectural win, no $$)

The slot-mark-cost win is real but mainBuf wasn't the bottleneck
-- pointer-array marks are fast in V8. Encoding overhead roughly
cancels the savings.

Code dropped. Faraday kept it as opt-in foundation for Phase 3,
but Phase 3 also doesn't ship, so the dependency chain doesn't
earn its keep on staging. Design notes preserved in
perf/notes/08-pdf-lib.md as the takeaway worth keeping.
Mirrors Phase 2's PDFDict shape onto PDFArray: each instance
becomes a view into a shared arrayBuf Float64Array via
this.d = packed (start, length), with the same temp-then-commit
parseArray pattern and the same 4-bit-tag + 49-bit-payload slot
encoding. Bit budget: 24 + 15 = 39 bits per array (one more
length bit than dict).

Measured result on faraday: heap win + CPU regression.

  process duration  1.09 s -> 1.45 s  (+360 ms, +33 %)
  GC self-time      149 ms -> 144 ms  (flat)
  heap sampled      65 MB  -> 58 MB   (-7.6 MB, -12 %)
  parseArray row    19.6 MB -> 0      (out of top 10)
  structural        byte-identical    (1651 pages, 1773 outline)

The heap win is real: 79k PDFArrays stop allocating backing
arrays. The CPU regression is mostly per-slot decode during save
-- PDFDict.copyBytesInto + PDFArray.copyBytesInto together
iterate ~3M slots, each calling decodeValue (10-case switch +
pool lookup). V8 doesn't inline decodeValue across the prototype
boundary; ~100 ns x 3M = ~300 ms.

Code dropped (depended on Phase 2's fast-dict-encoded.mjs, also
not on staging). Design notes preserved in
perf/notes/08-pdf-lib.md; the follow-up Phase 3β attempts to
recover most of the 300 ms by hand-inlining the common decode
cases at the hot call sites.
…t ship.

The Phase 3 CPU regression was almost entirely per-slot decode
during save -- PDFDict.copyBytesInto + sizeInBytes +
PDFArray.copyBytesInto + sizeInBytes iterate ~3M slots, each
calling decodeValue (10-case switch + pool lookup). V8 doesn't
inline across the prototype-method boundary; ~100 ns x 3M ~=
+300 ms.

(β) hand-inlines decodeValue's switch into all four hot methods.
The switch body is copy-pasted verbatim into each loop, giving
V8 a monomorphic call site per case branch.

Measured deltas vs Phase 3 (pre-inline) on the book:
  (garbage collector)      144 ms -> 130 ms  (-14 ms win)
  PDFObjectParser.parseName 106 ms -> 70 ms  (-36 ms win, surprise)
  PDFDict.copyBytesInto      57 ms -> 49 ms  (-8 ms)
  fastParseDictEncoded       59 ms -> 63 ms  (+4 ms)
  heap sampled              57.8 MB ~  58.0 MB (flat)
  structural                byte-identical (1651 pages, 1773 outline)

parseName's drop is a downstream effect of V8 re-optimizing the
call graph once the hot copyBytesInto / sizeInBytes paths became
monomorphic per case branch.

Net of full Phase 2 + 3 + β vs P1 baseline (fast-dict-onebuf):
  heap                65.4 MB -> 58.0 MB  (-7.4 MB, -11 %)
  GC self-time        149 ms -> 130 ms   (-19 ms, -13 %)
  CPU residual        ~+200 ms across many frames (noisy)

Architectural conclusion: Float64Array encoded storage works
correctly and delivers a real heap+GC win, but the per-slot
encoding overhead exceeds the slot-mark savings. V8 marks
pointer arrays faster than assumed (~10-20 ms for 2.4M slots,
not 100+). The original Object[] polymorphic .copyBytesInto()
was actually fine; replacing it with explicit switch dispatch
helps GC and parseName but hurts dict-side hot loops.

Code dropped (depended on Phase 2 / Phase 3 architecture, also
not on staging). Notes preserved in perf/notes/08-pdf-lib.md as
the endpoint of this storage-shape exploration. The next move
on the same theme is the much narrower "one-buffer for
PDFArray" (fast-array-onebuf), which mirrors fast-dict-onebuf's
shape directly and does ship.
KubaO added 14 commits May 24, 2026 01:57
Wires docs/lib/measure-pass.mjs in front of PDFDocument.load in
docs/render-book.mjs. The no-allocate walker counts dictSlots once,
fast-dict-onebuf pre-sizes its main Array to the exact count
(main.length = N), and V8 growth resizes during load go away.

Net wall-clock on the book is ~+40 ms (walker ~60 ms, load saves
~20). That's the smallest of the four Phases evaluated and the
only one whose tradeoff is acceptable shipping: Phase 2 is a
regression, Phase 3 / 3β recover most of it but only for a ~7 MB
heap win that doesn't justify the CPU cost. Phase 1's bound is
mostly insurance -- mainBuf was over-allocating by ~60 K slots out
of 2.4 M -- but it lays the plumbing for any future shape change
to ship without re-doing the wiring.

Also adds --measure-pass to both canonical commands in
perf/README.md and a flag-rationale entry parallel to
fast-dict-onebuf's.
Mirror of fast-dict-onebuf's strategy applied to PDFArray. Every
committed element lives in a single append-only `arrayMain` JS
Array, kept for the document's lifetime. Each PDFArray instance is
a view via packed (start, length) in `d`. Per-instance
`this.array = []` allocation goes away; ~79 k PDFArrays stop
allocating per-instance backing arrays + grow doublings.

Storage is a plain heterogeneous JS Array -- slots hold the
original PDFObject references, reads are `arrayMain[start + i]`
with no decode. This is the explored-but-didn't-ship Phase 3
shape minus the Float64Array encoding (which cost ~300 ms of
decodeValue dispatch on save's copyBytesInto across ~3 M slots).
The plain-reference shape skips that entirely.

Parser uses a per-parser _arrayTemp + length cursor as a recursion
stack, parallel to fast-dict-onebuf's _dictTemp; each parseArray
invocation pushes onto temp, commits its frame to arrayMain in one
contiguous append, and pops temp back. Dict / array temps are
independent so cross-recursion is fine.

Mutations: in-place replace for set, in-place extend at HWM for
push, COW for insert / remove / push-not-at-HWM. Same at-HWM
safety logic as fast-dict-onebuf; no owned bit needed.

Bit budget: 24-bit start (16 M slots) + 16-bit length (65 536
elements, max observed ~25 k on the book) = 40 bits, well within
Number.MAX_SAFE_INTEGER.

Singleton context is duplicated (10 lines) rather than shared with
fast-dict-onebuf -- each shim stays independently injectable.

Production wiring:
- docs/render-book.mjs: import setExpectedArraySlots; call after
  measure-pass before PDFDocument.load.
- perf/measure.mjs: --fast-array-onebuf flag. Composes with
  --fast-dict-onebuf. --measure-pass also drives
  setExpectedArraySlots when both shims are on.
- perf/README.md: --fast-array-onebuf in both canonical commands,
  flag-rationale entry, run.bat row, What-shipped + Investigation-log.

Heap impact (process phase, 512 B sampling, fast-array-onebuf on
vs the immediate predecessor baseline):
  total sampled    65.6 MB -> 51.9 MB  (-13.7 MB, -20.8 %)
  parseArray row   19.6 MB -> 0        (out of top 15)
  new shim row     -        -> 4.2 MB  (PDFArray wrappers)

CPU impact (process wall, pinned 0x5500 / High, no profiler, 3 paired runs):
  P1 only        median 1.07 s, mean 1.09 s
  P1 + this      median 1.02 s, mean 1.01 s
  Δ              mean +0.08 s (this shim slightly faster, within noise)

The CPU regression that showed up under --cpu-profile-process was
profiler-induced noise; gone once we pin and drop the sampler.

Cumulative heap arc since Map-backed PDFDict: 152 MB -> 52 MB
(-66 %). The endpoint of the dict + array allocator refactor.
Upstream caches `<obj> <gen> R` on each PDFRef so toString /
sizeInBytes / copyBytesInto can read it back. After fast-array-onebuf
shipped, the heap profile showed PDFParser.parseIndirectObjectHeader
at 13.7 MB / 25 % of total -- attribution chain (via
perf/find-heap-callers.mjs):

  parseIndirectObjectHeader  → skipJibberish  (14.2 MB)
    → matchIndirectObjectHeader (try/catch wrapper)
      → parseIndirectObjectHeader  → fastOf

skipJibberish runs after every successful indirect object parse and
speculatively calls matchIndirectObjectHeader to detect the next
`N M obj` header. On valid PDFs it always succeeds. fastOf fires
once per indirect-object boundary, populating the dense-array cache;
the subsequent "real" parseIndirectObject is a cache hit. V8 inlines
fastOf at this call site (small + hot from speculation), so the
attribution lands on the caller -- 13.7 MB of which was the
tag-string churn (`objectNumber + ' 0 R'`): V8 builds 1-2 intermediate
concat strings + the final ~25-35 B tag, ~150 k times.

Eliminating the field collapses both rows:

  parseIndirectObjectHeader  13.7 MB -> 9.3 MB  (-4.3 MB)
  fastOf (refs)               7.7 MB -> 4.8 MB  (-2.9 MB)
  total sampled              51.9 MB -> 45.2 MB (-6.7 MB, -13 %)
  parseArray row             gone (already collapsed by fast-array-onebuf)

The remaining 9.3 MB at parseIndirectObjectHeader and 4.8 MB at
fastOf are the PDFRef instances themselves (Object.create +
objectNumber + generationNumber fields, ~32-48 B × ~150 k) plus
attribution leakage from V8's inlining. Hard floor without dropping
per-PDFRef wrappers entirely.

Prototype methods now compute results from objectNumber /
generationNumber directly:
- copyBytesInto: writes digits straight into the output buffer with
  a no-allocation _writeUint helper (divide-and-write-backwards into
  the caller's buffer). No copyStringIntoBuffer call.
- sizeInBytes: returns digitCount(obj) + digitCount(gen) + 3.
- toString: builds on demand. Debug only, no caching needed.

Both gen=0 (no tag set) and gen!=0 (tag set by upstream's
constructor but ignored) work; the gen!=0 path's tag string is
allocated-then-wasted (~18 % of refs, ~1 MB), not worth patching
the constructor to avoid.

CPU impact (pinned 0x5500 / High, no profiler, 4 runs each):
  with-tag    median 1.045 s, mean 1.045 s
  tagless     median 1.030 s, mean 1.030 s
  Δ           ~15 ms tagless faster (in the noise but trending)

Output PDF is byte-identical modulo /CreationDate + /ModDate
timestamps (verified by inflating + diffing all 453 ObjStm streams).
On valid PDFs, the byte after each successful parseIndirectObject
+ skipWhitespaceAndComments is almost always a digit -- the start
of the next `N M obj` header. skipJibberish only exists to recover
from invalid PDFs that wedge garbage between indirect objects, but
its hot path runs unconditionally: ~150 k calls per load on the
book, each speculatively trying matchKeyword(xref/trailer/startxref)
(all fail on a digit) and then matchIndirectObjectHeader (a
try/catch around parseIndirectObjectHeader → parseRawInt × 2 →
matchKeyword('obj') → fastOf round-trip). The speculation succeeds
every time, the cursor gets rewound, and the outer while loop's
IsDigit check confirms what the speculation already proved.

Short-circuit when the cursor is on a digit; fall through to
skipJibberish on anything else (xref / trailer / startxref keyword
starts, or real jibberish between indirect-object sections).

The once-per-section skipJibberish in parseDocumentSection (after
maybeParseTrailer) is unaffected -- it handles boundaries between
PDF revisions / EOF where stray bytes are spec-legal.

Wall-clock impact (pinned 0x5500 / High, no profiler, 4 paired runs):

  without fast-path  median 1.07 s, mean 1.053 s
  with    fast-path  median 0.995 s, mean 0.985 s
  Δ                  ~67 ms faster (mean), ~6 % of process phase

Phase breakdown isolates the win to load (mean 0.518 → 0.455 s,
-62 ms); save is flat as expected (fast-path is load-side only).

Heap unchanged (0 MB delta, as predicted) -- the PDFRef instances
the speculation allocated were already attribution-shifted to the
real parseIndirectObject's cache miss, not new allocations.

Output PDF byte-identical to the pre-patch baseline (verified by
inflating + diffing all 453 ObjStm streams modulo timestamps).
… process).

fast-refs builds PDFRef instances via
`Object.create(PDFRef.prototype) + fresh.objectNumber = ... +
fresh.generationNumber = ...`. V8 transitions the hidden class on
each property write and routes the result through the slow-property
path. Empirically on the book that's ~60 B per pooled PDFRef, vs
~31 B for PDFName (built via `new PDFName(...)` -- a real
constructor with stable hidden class from the first instance).

This shim swaps `Object.create + writes` for a plain function used
as a constructor that sets both fields in one shot. Aliasing
`_FastRef.prototype = PDFRef.prototype` keeps `instanceof PDFRef`
satisfied and resolves all prototype methods on the shared
prototype (no extra proto-chain hop). gen != 0 still falls back to
upstream PDFRef.of's Map-based pool (rare on freshly-parsed PDFs).

Measured on the book (paired heap + cpu profile, --fast-refs vs
--fast-refs-class with the rest of the production shim set on):

  Heap (sampled total)            45.26 MB -> 41.39 MB  (-3.87 MB, -8.5 %)
  fastOf / fastClassOf row         4 696 KB ->  3 435 KB (-1 261 KB)
  create (builtin)                 3 379 KB ->  2 627 KB (-  752 KB)
  parseIndirectObjectHeader row    9 115 KB ->  7 435 KB (-1 680 KB)

Per-PDFRef savings: ~16 B/instance × 226 k unique = ~3.7 MB.
Not the full 30 B-to-PDFName-floor (PDFRef carries 2 fields vs
PDFName's 1), but a clean win and the construction-style change
applies symmetrically to the other Object.create-built shapes
(fast-dict-onebuf._makeFromRange, fast-array-onebuf._makeFromRange)
for the next round.

  Process wall-clock              1.13 s -> 0.99 s  (-140 ms, -12 %)
    load                          0.52 s -> 0.47 s
    save                          0.51 s -> 0.44 s
  fastOf (PDFRef) self-time       28 ms  -> (out of top 15)

GC self-time barely moved (87 ms -> 82 ms), consistent with the
allocation-rate drop being modest relative to mark-cost (the live
fast-dict-onebuf mainBuf still dominates the GC bill).

fast-refs.mjs stays in the tree as an A/B baseline. measure.mjs
mutex-checks --fast-refs and --fast-refs-class so the wrong one
can't be loaded silently. render-book.mjs swaps the import:
production runs through fast-refs-class now.
Same shape change fast-refs-class applied to PDFRef, now applied to
the PDFDict factory paths. Replaces
`Object.create(ProtoClass.prototype) + pd.d = ... [+ pd.normalized
= false + pd.autoNormalizeCTM = true]` in _makeFromRange and clone
with one plain-function constructor per subclass (`_FastDict`,
`_FastCatalog`, `_FastPageTree`, `_FastPageLeaf`), each with the
right field assignments in its body and its prototype aliased to
the upstream prototype so instanceof + method dispatch are
unchanged. Unknown subclasses fall back to the original
Object.create path (defensive; nothing in our pipeline hits it).

Measured on the book (paired heap + cpu profile, fast-refs-class
baseline vs + this change):

  Heap (sampled total)            41.39 MB -> 35.41 MB  (-5.98 MB, -14.4 %)
  _makeFromRange (dict)           16 484 KB -> 11 404 KB (-5 080 KB)
  create (builtin)                 2 627 KB ->    921 KB (-1 706 KB)
  _FastDict (new attribution row)        — ->    621 KB

Per-PDFDict saving: ~20 B/instance × 260 k = ~5.2 MB. Matches the
delta on the row plus the builtin's drop minus the new
constructor-frame attribution. Cumulative since fast-refs-class:
total sampled 45.26 MB -> 35.41 MB = -9.85 MB (-22 %) over two
shape-change commits.

CPU is roughly flat (process 0.99 s -> 1.03 s under cpu profile,
within noise). GC self-time +18 ms (82 -> 101 ms), consistent with
the existing fast-dict-onebuf trade-off documented in
perf/README.md: the dominant GC cost is the live mainBuf scan, not
allocation rate, so cutting allocation doesn't cut mark time. The
allocation-rate reduction still matters for sustained-load memory
pressure even when it doesn't move single-shot wall-clock.

Output PDF byte-identical modulo /CreationDate + /ModDate
timestamps (no content path touched, only the JS object shape used
to wrap the parsed dict range).
…130 ms process).

Same shape change applied to PDFArray's factory paths -- replace
`Object.create(PDFArray.prototype) + pa.d = ...` in _makeFromRange
and clone with a `_FastArray` plain-function constructor, prototype
aliased to PDFArray.prototype. No subclass dispatch needed
(PDFArray has none in pdf-lib, unlike PDFDict).

Measured on the book (paired heap + cpu profile, prior commit's
dict-class baseline vs + this change):

  Heap (sampled total)            35.41 MB -> 33.68 MB  (-1.73 MB, -4.9 %)
  fastParseArrayOneBuf row         4 372 KB ->  3 334 KB (-1 038 KB)
  create (builtin)                   921 KB ->   (out of top 15) -921 KB
  Process wall-clock               1.03 s ->  0.90 s  (-130 ms, -13 %)
  GC self-time                    100.90 ms -> 58.69 ms (-42 ms)

Per-PDFArray saving: ~22 B/instance × ~80 k = ~1.7 MB. Matches the
row delta + builtin drop.

Surprise win on GC + wall-clock: cuts GC self-time 42 ms despite a
much smaller allocation drop than fast-refs-class or
fast-dict-onebuf. The likely reason is that with all three
shape-changes in place, V8 sees fully monomorphic call sites for
PDFRef / PDFDict / PDFArray construction and method dispatch --
before the array change there was still one slow-property shape in
the mix dragging IC perf. Confirmed by the cumulative process-time
arc:

  baseline (fast-refs)                 1.13 s    87 ms GC
  + fast-refs-class                    0.99 s    82 ms GC
  + fast-dict-onebuf class shape       1.03 s   101 ms GC
  + fast-array-onebuf class shape      0.90 s    59 ms GC

The dict-only state had a slight CPU regression (+40 ms vs
refs-class) that the array change undid and then some. Ship the
combo, not just the two big-row ones.

Cumulative across the three commits in this round (baseline ->
array-class):

  Process wall-clock              1.13 s   -> 0.90 s    (-230 ms, -20 %)
  Total sampled heap              45.26 MB -> 33.68 MB  (-11.58 MB, -25.6 %)
  GC self-time                    87 ms    -> 59 ms     (-32 %)
… round.

The three commits that just landed (fast-refs-class, fast-dict-onebuf
class shape, fast-array-onebuf class shape) did the same shape
change to PDFRef, PDFDict, and PDFArray's factory paths -- swap
`Object.create(proto) + property writes` for a plain-function
constructor whose body assigns all fields in one shot, with the
prototype aliased so `instanceof` + method dispatch stay
unchanged. V8 gives the new instances a stable hidden class from
the first one and per-instance heap cost drops from ~60 B to
~44 B.

Each commit's narrative landed in notes/08-pdf-lib.md per the
staging convention (perf/README.md stays light/operational; per-
shim story lives in notes/). This commit folds in the closing
context that ties the round together:

- Per-instance savings table across all three wrappers in one
  view.
- Investigation aside: how the `parseIndirectObjectHeader` 9 MB
  heap row was a V8 inlining-attribution artifact for fastOf's
  allocations downstream, confirmed via `node --no-turbo-inlining`
  paired run. The hand-inlined `fast-pioh.mjs` attempt was
  deleted after proving the negative; call-counting
  instrumentation in perf/instrument-pioh.mjs (arrives in the
  next commit).
- Caveats: singleton subclass set in fast-dict-onebuf's dispatch,
  shared prototype semantics for `instanceof` / method dispatch,
  residual polymorphism on the gen != 0 PDFRef fallback.
- Top-5 heap rows post-round + the next-step menu (wrapper
  elimination vs targeted smaller shrinks). The big rows are now
  at the per-instance floor for V8 objects with 1-2 inline fields.

No code changes.
…shape round).

Both scripts informed the constructor-shape PDFRef / PDFDict /
PDFArray work that landed earlier in this round; capture them in
the tree so future "is row X a real hot spot or a labelling
artifact" investigations can reuse the pattern.

`instrument-pioh.mjs` -- wraps PDFParser.parseIndirectObjectHeader +
matchIndirectObjectHeader with counters, reports per-load call
counts and kept-heap delta. Built to answer the prerequisites
before committing to inline parseIndirectObjectHeader (was the
function actually a hot spot? was fast-sync-load's digit
short-circuit firing? was speculation throwing?). Output on the
book: 226 k pioh calls, 0 throws, 0 mih calls, ~35 MB kept heap.
The 9 MB self-attribution turned out to be V8 inlining fastOf into
the caller frame -- script informed the correct attack surface
(wrapper construction, not the parser body).

`instrument-objclasses.mjs` -- two views of "how many PDF*
wrappers do we build per load":
- `.of()` call counts for the pooled / factory-method classes
  (PDFRef 1.43 M calls / 226 k unique, PDFName 1.68 M / 4.8 k,
  PDFNumber 284 k / 16 k, PDFString 7.4 k, PDFRawStream 2 k);
- post-load walk of PDFContext.enumerateIndirectObjects() bumping
  per-runtime-class counters for the top-level shapes (PDFDict
  221 k top-level / ~261 k incl. nested; 1 PDFCatalog; 238
  PDFPageTree; 1 651 PDFPageLeaf).

Used to size the constructor-shape attack: confirmed PDFRef +
PDFDict + PDFArray dominate the per-instance × instance-count
product, so they were the right three to convert.

Both scripts updated to import fast-refs-class (current
production) rather than the older fast-refs.mjs (now A/B
baseline) so the numbers reflect production.

Also adds README entries under "What's in this folder", next to
the existing pdf-lib-side standalone harnesses
(profile-load.mjs, profile-roundtrip.mjs).
…ss, -9 %).

PDFObjectParser.prototype.parseName fires 1.68 M times per load on
the book and the call density was the twinbasic#1 row in the process CPU
profile after the constructor-shape round closed
(PDFObjectParser.parseName @ ~87 ms self + fastOf @ ~57 ms callee
= ~144 ms combined, ~16 % of process). 4 787 of those calls are
unique; the other 99.7 % hit the same handful of dict keys (Type,
Length, Pages, MediaBox, ...) over and over. The per-call work --
build a string via `name += charFromCode(byte)` then hand it to
PDFName.of's string-keyed Map -- is pure overhead on the hot path:
the answer was already cached, we just kept rebuilding the key.

A failed first attempt (the v1 of this shim, not committed) tried
to keep the cons-string accumulator but skip the per-byte
this.bytes.peek/.next/.done method dispatch. CPU didn't move:
V8 had already optimised the cons-string path well and the
saved method-call cost just shifted to attribution on the callers
(fastParseDictOneBuf / fastParseObject) under inlining. A second
failed sketch built the lookup string via
`String.fromCharCode.apply(null, buf.subarray(start, idx))` and
was SLOWER than upstream (~123 ms vs ~87 ms): .apply on a typed
array is a deopt path in V8.

This shim attacks the real surface: avoid producing the lookup
string at all on the hot path. Scan bytes with direct buffer
access, compute a Java-style `hash * 31 + byte` Smi hash in the
same pass, look up `Map<hash, Entry | Entry[]>` keyed by byte
content. Single-entry buckets (the vast majority -- 4 787 unique
names into 2^32 hash space gives essentially zero collisions)
store the Entry directly; collision buckets get promoted to an
Entry[] for linear scan. Entry carries a small Uint8Array copy of
the name body for exact equality check.

Cold path: byte-cache miss. Build the lookup string in one shot
(String.fromCharCode is fast for short ranges via direct args, no
.apply), call the upstream `PDFName.of` -- which on this stack
means fast-decode-name's string-keyed Map -- and cache the
returned PDFName in the byte-cache for next time. Both caches
converge on the same PDFName per logical name; direct
PDFName.of(...) calls from non-parser code (setOutline,
setMetadata) bypass the byte-cache and go straight through
fast-decode-name -- correct, no byte range available.

Measured on the book (paired heap + cpu profile, baseline =
prior --fast-refs-class + --fast-dict-onebuf + --fast-array-onebuf
state, this on top):

  Process wall-clock              0.90 s -> 0.82 s   (-80 ms, -9 %)
    load                          0.41 s -> 0.33 s
    save                          0.42 s -> 0.42 s
  parseName + fastOf (combined)   144 ms -> 58 ms    (-86 ms)
  PDFObjectParser.parseName       (gone from top 15)
  fastOf (PDFName decode-name)    52 ms  -> (gone from top 15)
  Heap (sampled total)            33.68 MB -> 34.98 MB (+1.30 MB)
    new fastParseName row          —     -> 1 269 KB (the cache)
    set (builtin)                  624 KB -> 852 KB (+228 KB)

Heap-profiled process wall-clock dropped much more (3.50 s ->
2.56 s, -940 ms) than the cpu-profiled run did -- because the
sampler's per-allocation bookkeeping is the dominant cost under
512 B sampling, and we just eliminated ~1.6 M transient string
allocations that were all under the sample threshold (so they
don't show up in the heap row, only in the sampler's wall-clock
overhead). Read the cpu number for "did we get faster"; read the
heap-row delta for "what's the long-lived cost" (+1.3 MB, the
cache itself).

GC self-time +21 ms on the cpu run (the live byte-cache adds to
mark cost), more than offset by the -80 ms parseName savings.

Output PDF byte-identical: the byte-cache and the string-cache
return the same PDFName instance per logical name, so all
downstream code sees the same identity.
d layout shifts from start[0:23] / length[24:37] to start[0:22] /
gap[23:24] / length[25:40], freeing bits 23 and 24 for PDFPageLeaf's
normalized + autoNormalizeCTM. _FastPageLeaf collapses to a single d
field; the booleans become prototype getters/setters that mask in/out
of d. start drops 24 -> 23 bits (8.4 M slots, still well above the
~2.3 M mainLen on the book); length grows 14 -> 16 bits (65 535,
ample headroom over the 8 706 observed max).

Heap saving on the 1 651 page leaves is sub-row at the 512 B sampler
resolution but real (~26 KB). Output byte-identical to baseline. CPU
flat (no PDFPageLeaf mutation paths fire on the book).
Split _FastRef into two constructors: _FastRef(objectNumber) for the
gen=0 path (every PDFRef on fresh-Chrome workloads) carrying a single
inline data field, and _FastRefGen(objectNumber, generationNumber)
for the rare gen!=0 path (the xref free entry at object 0). A
default `generationNumber = 0` on PDFRef.prototype supplies the
missing field for _FastRef instances via prototype lookup, so reads
of either property stay as plain data-property loads -- no accessor-
property boundary that would deopt the IC at upstream call sites
(PDFCrossRefSection.append, PDFCrossRefStream entry tuples,
PDFWriter.serializeToBuffer, fast-indirect-objects, ...).

Per-instance: 24 B (two slots) -> 16 B (one slot). On 226 k unique
PDFRefs, ~1.88 MB heap saved (paired heap profile: 34.96 MB ->
33.08 MB total sampled). CPU min flat (no-profile wall-clock 0.70 s
vs ~0.83 s pre, but the heap-saving lane isn't the source of CPU
movement). PDF output byte-identical.

An accessor-property variant (single packed d + getters for
objectNumber/generationNumber) was tried first and rejected: it
regressed heap by +1.6 MB and CPU by +70 ms because the getter
dispatch broke V8's monomorphic ICs at the upstream xref-write
sites, forcing recompilation paths that couldn't elide the
{ref, offset, deleted} object literals in PDFCrossRefSection.addEntry
as aggressively.
Companion to analyze-heap.mjs / find-heap-callers.mjs. Prints every
match for a substring on functionName, with the matched frame's
self-size and each direct child's self + descendant total.

Built during the PDFRef class-shape round to investigate why
`maybeParseCrossRefSection` showed 3.4 MB self with <40 KB worth of
named children -- a V8 inlining-attribution case where the parent
frame's compiled code absorbed `PDFCrossRefSection.addEntry`'s
object-literal allocations. The flat top-15 view can't tell you that;
this script can.
@KubaO KubaO merged commit e3f4b40 into twinbasic:main May 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant